Objectives

What are the 100 highest total wage zip codes? lowest? what are the 100 highest average wage zip codes? lowest? how does the 100 total wages of the zip codes differ from the 100 average wages of the zipcodes? top 3 states with the highest/lowest wages? Make clear concise graphs and compare to other types of existing data Create a machine learning model to predict future data

Load The Neccesary libraries

library(tidyverse)
library(maps)
library(mapdata)
library(plotly)
library(caTools)
library(reshape2)

Examine data

wages <- read_csv("free-zipcode-database.csv")
Rows: 81831 Columns: 20
-- Column specification ---------------------------------------------------------------------------------------------------
Delimiter: ","
chr (10): Zipcode, ZipCodeType, City, State, LocationType, WorldRegion, Country, LocationText, Location, Notes
dbl  (9): RecordNumber, Lat, Long, Xaxis, Yaxis, Zaxis, TaxReturnsFiled, EstimatedPopulation, TotalWages
lgl  (1): Decommisioned

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(wages)
Rows: 81,831
Columns: 20
$ RecordNumber        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26~
$ Zipcode             <chr> "00704", "00704", "00704", "00704", "00704", "00704", "00704", "00704", "00705", "00705", "00~
$ ZipCodeType         <chr> "STANDARD", "STANDARD", "STANDARD", "STANDARD", "STANDARD", "STANDARD", "STANDARD", "STANDARD~
$ City                <chr> "PARC PARQUE", "PASEO COSTA DEL SUR", "SECT LANAUSSE", "URB EUGENE RICE", "URB GONZALEZ", "UR~
$ State               <chr> "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR", "PR~
$ LocationType        <chr> "NOT ACCEPTABLE", "NOT ACCEPTABLE", "NOT ACCEPTABLE", "NOT ACCEPTABLE", "NOT ACCEPTABLE", "NO~
$ Lat                 <dbl> 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 18.14, 18.14, 18.14, 18.14, 18.14, 18~
$ Long                <dbl> -66.22, -66.22, -66.22, -66.22, -66.22, -66.22, -66.22, -66.22, -66.26, -66.26, -66.26, -66.2~
$ Xaxis               <dbl> 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.3~
$ Yaxis               <dbl> -0.87, -0.87, -0.87, -0.87, -0.87, -0.87, -0.87, -0.87, -0.86, -0.86, -0.86, -0.86, -0.86, -0~
$ Zaxis               <dbl> 0.30, 0.30, 0.30, 0.30, 0.30, 0.30, 0.30, 0.30, 0.31, 0.31, 0.31, 0.31, 0.31, 0.31, 0.31, 0.3~
$ WorldRegion         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ Country             <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US~
$ LocationText        <chr> "Parc Parque, PR", "Paseo Costa Del Sur, PR", "Sect Lanausse, PR", "Urb Eugene Rice, PR", "Ur~
$ Location            <chr> "NA-US-PR-PARC PARQUE", "NA-US-PR-PASEO COSTA DEL SUR", "NA-US-PR-SECT LANAUSSE", "NA-US-PR-U~
$ Decommisioned       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA~
$ TaxReturnsFiled     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ EstimatedPopulation <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ TotalWages          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ Notes               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~

Looking at United States only

wages <- wages %>% filter(Country == "US")

Removing unwanted variables

wages <- wages %>% select(-RecordNumber,-Xaxis,-Yaxis,-Zaxis,-Location,-WorldRegion,-LocationText,-LocationType,-Country,-Decommisioned,-Notes)

Removing Unwanted cases like puerto rico and NA values and duplicates

unique(wages$State)
 [1] "PR" "NJ" "NY" "VI" "MA" "ME" "NH" "VT" "CT" "RI" "DE" "PA" "WV" "KY" "TN" "VA" "GA" "IN" "OH" "IL" "IA" "MN" "WI"
[24] "MT" "ND" "SD" "KS" "MO" "NE" "CO" "WY" "ID" "UT" "AZ" "NM" "TX" "CA" "NV" "OR" "WA" "AK" "GU" "HI" "AS" "PW" "FM"
[47] "MP" "MH" "FL" "SC" "AL" "MS" "LA" "AR" "OK" "MI" "DC" "MD" "NC" "AE" "AA" "AP"
wages <- wages %>% filter(State != "PR")

for(i in 1:9){
  wages<- wages %>% filter(!is.na(wages[i]))
}

wages <- wages %>% distinct(Zipcode, .keep_all = TRUE)

We see a loss of about 50,000 values in the data set. About 25,000 lost from blank values and another 25,000 from duplicates. Since this is the majority of the data set, it is not a very clean data set and may not have a very accurate representation of the initial data. However, I will still plan to analyse the data in a comprehensive way in order to find answers to the proposed questions.

Create an average wage column by dividing the total wage by the estimated population and formatting for the dollar amounts

wages <- wages %>% mutate(averageWage = format(round(TotalWages/EstimatedPopulation,2), nsmall = 2))
wages <- wages %>% mutate(averageWage = TotalWages/EstimatedPopulation)

EXporting a csv file in order to use the cleaned data in other programs i.e. Tableau

write.table(wages,file = "CleanWages.csv",row.names = F,sep = ",")

What is the 100 highest wage zip codes?

options(scipen = 999)
highestwage <- wages %>% arrange(-TotalWages)
highestwage <- highestwage %>% mutate(rank = 1:28844)
highestwage <- highestwage %>% filter(rank <= 100)
print(head(highestwage))

highestwage %>% ggplot(aes(x = State,fill = State)) + geom_bar() + labs(title = "Number of High Wage Zip Codes Per State in the Top 100" ,x = "State",y = "Frequency")

Looking at the graph of each state in the top 100 and the Total wages of the zip codes in each state, it is easy to see that California, New York, Texas, and Illinois contribute the majority of the zip codes with the highest wages.

options(scipen = 999)
highestwage %>% ggplot(aes(x = TotalWages/1000000000)) + geom_density() + labs(title = "Density of Total Wages in the Top 100" ,x = "Total Wages (Billion $)",y = "Density") 

By taking a look at the graph of the density of each price we can see that the majority of zip codes in the top 100 highest wages are around 1.7 billion dollars. Also as you get past 1.9 billion dollars the amount of zip codes decreases steadily.

What is the 100 lowest wage zip codes?

lowestwage <- wages %>% arrange(TotalWages)
lowestwage <- lowestwage %>% mutate(rank = 1:28844)
lowestwage <- lowestwage %>% filter(rank <= 100)
print(head(lowestwage))
lowestwage %>% ggplot(aes(x = State,fill = State)) + geom_bar() + labs(title = "Number of Low Wage Zip Codes Per State in the Bottom 100" ,x = "State",y = "Frequency")

Here we see that Michigan, Arizona, and Texas are the 3 most frequent states in the Bottom 100.

options(scipen = 999)
lowestwage %>% ggplot(aes(x = TotalWages/1000000)) + geom_density() + labs(title = "Density of Total Wages in the Bottom 100" ,x = "Total Wages (Million $)",y = "Density") 

Similar to the last density chart, this graph predicts the majority of wages in the bottom 100 are around $4.5 million.

What is the 100 highest average wage zip codes?

highestavgwage <- wages %>% arrange(-averageWage)
highestavgwage <- highestavgwage %>% mutate(rank = 1:28844)
highestavgwage <- highestavgwage %>% filter(rank <= 100)
print(head(highestavgwage))
highestavgwage %>% ggplot(aes(x = State,fill = State)) + geom_bar() + labs(title = "Number of Average Wage Zip Codes Per State in the Top 100" ,x = "State",y = "Frequency")

The average zip code wage in each state shows a different story than the total wages in each zip code. The two states that make up the majority are New York and California, where New York is much more frequent.

options(scipen = 999)
highestavgwage %>% ggplot(aes(x = averageWage/1000)) + geom_density() + labs(title = "Density of Average Wages in the Top 100" ,x = "Total Wages (Thousand $)",y = "Density") 

The densities show that the majority of the average zip code wages in the top 100 are around 190 thousand dollars.

What is the 100 lowest average wage zip codes?

lowestavgwage <- wages %>% arrange(averageWage)
lowestavgwage <- lowestavgwage %>% mutate(rank = 1:28844)
lowestavgwage <- lowestavgwage %>% filter(rank <= 100)
print(head(lowestavgwage))
lowestavgwage %>% ggplot(aes(x = State,fill = State)) + geom_bar() + labs(title = "Number of Average Wage Zip Codes Per State in the Bottom 100" ,x = "State",y = "Frequency")

Similar to the lowest total wages, Michigan and Arizona are the highest contributors to the bottom 100 average zip code wages.

options(scipen = 999)
lowestavgwage %>% ggplot(aes(x = averageWage/1000)) + geom_density() + labs(title = "Density of Average Wages in the Bottom 100" ,x = "Total Wages (Thousand $)",y = "Density") 

Here the majority of the average wages in the bottom 100 are around 7.5 thousand dollars.

What are the 3 states with the highest wages by total?

statewage <- wages %>% group_by(State) %>% summarise(
  totalstatewages = sum(TotalWages))
highstatewage <- statewage %>% arrange(-totalstatewages)
highstatewage <- highstatewage %>% mutate(rank = 1:51)
highstatewage <- highstatewage %>% filter(rank <= 3)
print(head(highstatewage))

These results support the previous conclusion from Graph 1 where we saw that California Texas and New York were among the main contributors of the highest state wages. However these results give us further insight that Illinois is not actually among the top 3 highest wage states when looking at the total wages

What are the 3 states with the lowest wages by total?

lowstatewage <- statewage %>% arrange(totalstatewages)
lowstatewage <- lowstatewage %>% mutate(rank = 1:51)
lowstatewage <- lowstatewage %>% filter(rank <= 3)
print(head(lowstatewage))

What does the average of each state look like?

state_avg_wage <- wages %>% 
  group_by(State) %>% 
  summarise(avgstatewages = mean(TotalWages))

plot_geo(data = state_avg_wage,
                      locationmode = 'USA-states') %>% 
  add_trace(locations = ~State,
            z = ~state_avg_wage$avgstatewages,
            zmin = min(state_avg_wage$avgstatewages), 
            zmax = max(state_avg_wage$avgstatewages),
            color = state_avg_wage$avgstatewages) %>% 
  layout(geo = list(scope= 'usa'),
         title = "\nAverage Wages in the United States by State") %>% colorbar(tickprefix = "$")

The graph above shows that out of all the states, the highest wage location is Washington DC with an average wage of about $550 Million. The second two locations are California and New Jersey by a $150 million wage gap. In comparison to all other Locations, these three stand out as states and territories with high wages.

Showing the Highest Average Wages in Tabular Form

state_avg_wage <- state_avg_wage %>% 
  arrange(-avgstatewages ) 
head(state_avg_wage)

Importing Other Datasets From Online For Further Analysis

CostOfLiving <- read.csv("Cost of Living.csv")
StateAbrev <- read.csv("StateAbrev.csv")

Preparing Data For Joining

StateAbrev <- StateAbrev %>% rename(State = USPS.Abbreviation)
CostOfLiving <- CostOfLiving %>% rename(State.Name = State)

Joining the Datasets

CostOfLiving <- CostOfLiving %>% full_join(StateAbrev, by = "State.Name")
State_avg_vs_COL <- state_avg_wage %>% full_join(CostOfLiving, by = "State")
State_avg_vs_COL <- State_avg_vs_COL %>% select(-Rank,-State.Name)

Working With the New Dataset To Determine Potential Relationships

plot_geo(data = State_avg_vs_COL,
                      locationmode = 'USA-states') %>% 
  add_trace(locations = ~State,
            z = ~State_avg_vs_COL$Index,
            zmin = min(State_avg_vs_COL$Index), 
            zmax = max(State_avg_vs_COL$Index),
            color = State_avg_vs_COL$Index) %>% 
  layout(geo = list(scope= 'usa'),
         title = "\nCost of living Index in the United States by State")

Upon inspection of the Cost of Living map in comparison to the Averages Wages map, there are some clear trends. California and DC Remain in the top 3 in both maps. However Hawaii has a much higher cost of living than compared to its average wage. This is most likely due to its status as a “vacation state”. The rest of the states are kind of ambiguous when looking at the choropleth map. Further inspection of the correlation will give us an idea of the relationship.

Testing Correlation in order to quantify the relationship

cor(State_avg_vs_COL$Index, State_avg_vs_COL$avgstatewages, use = "pairwise.complete.obs")
[1] 0.5894002

Here we see that there is a moderately strong positive correlation between the two variables. Intuitively this is not a surprising discovery, however I will make a correlation matrix to see which of the factors that contribute to the Cost of living carry more weight when looking at the average wage in each state.

Correlation Matrix

cor_matrix <- State_avg_vs_COL %>%
  select(-State) %>%
  cor(use = "pairwise.complete.obs")
cor_matrix <- round(cor_matrix, digits = 2)

meltCorMat <- melt(cor_matrix)

meltCorMat %>% ggplot(aes(x = Var1, y = Var2, fill = value)) + geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1))+
 coord_fixed() +
  geom_text(aes(Var2, Var1, label = value), color = "black", size = 4)

Here we can see that the two biggest correlations other than The total cost of living index is the housing price index and a misc. index which I summarize to mean recreational activities and commodities such as eating out and entertainment systems.

Creating scatter map based on longitude and latitude

geo_prop <- list(scope = 'usa',
                 projection = list(type = 'albers usa'), 
                 showland = TRUE,
                 showsubunits = TRUE,
                 landcolor = toRGB('gray10'),
                 showlakes = TRUE, 
                 lakecolor = toRGB('white'))

plot_geo(wages, 
        lat = ~Lat,
        lon = ~Long,
        marker = list(size = wages$averageWage/15000),
        text = wages$City) %>% layout(geo = geo_prop, title = "\nDensity and Intensity of the Average Wage for US Zip Codes")
No scattergeo mode specifed:
  Setting the mode to markers
  Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
No scattergeo mode specifed:
  Setting the mode to markers
  Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

The Map above shows all of the zip codes plotted via their latitude and longitude. The size or intensity of every point is proportional to how high the average wage in the zip code is. I just figured I would plot this because its a good looking graph and can express the range of zip codes left in the data after it had been cleaned.

Training A Linear Regression Model to Predict Average Wages In a Zip Code

Training the Model

LineWage <- wages %>% select(-State,-City,-ZipCodeType,-Zipcode)

set.seed(2)
split <- sample.split(LineWage,SplitRatio = 1/4)
train <- subset(LineWage, split = "TRUE")
test <- subset(LineWage, split = "FALSE")

model <- lm(averageWage~.,data = train)
summary(model)

Call:
lm(formula = averageWage ~ ., data = train)

Residuals:
   Min     1Q Median     3Q    Max 
-34918  -3332   -717   1722 167551 

Coefficients:
                            Estimate       Std. Error t value             Pr(>|t|)    
(Intercept)         20594.1304072651   409.3614252313  50.308 < 0.0000000000000002 ***
Lat                    29.3502510222     8.1391765199   3.606             0.000311 ***
Long                   26.8223086918     2.8040993302   9.565 < 0.0000000000000002 ***
TaxReturnsFiled         0.3766591942     0.0677842110   5.557         0.0000000277 ***
EstimatedPopulation    -1.3068716836     0.0397504079 -32.877 < 0.0000000000000002 ***
TotalWages              0.0000537401     0.0000003298 162.923 < 0.0000000000000002 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7090 on 28838 degrees of freedom
Multiple R-squared:  0.5078,    Adjusted R-squared:  0.5077 
F-statistic:  5951 on 5 and 28838 DF,  p-value: < 0.00000000000000022

Testing the Model

predict <- predict(model, test)

Graphing for Accuracy

plot(predict, type = "l",col = "red") + lines(test$averageWage)
integer(0)

Calculating Root Mean Square Error for Accuracy

rmse <- sqrt(mean(predict-LineWage$averageWage)^2)
print(rmse)
[1] 0.00000000001681188

The error calculation is very low; this indicates a well trained model for future data.

---
title: "Wages Via Zip Code"
author: "Elijah Silfies"
date: "11/11/2021"
output: html_notebook
---
### Objectives
What are the 100 highest total wage zip codes? lowest?
what are the 100 highest average wage zip codes? lowest?
how does the 100 total wages of the zip codes differ from the 100 average wages of the zipcodes?
top 3 states with the highest/lowest wages?
Make clear concise graphs and compare to other types of existing data
Create a machine learning model to predict future data



### Load The Neccesary libraries
```{r}
library(tidyverse)
library(maps)
library(mapdata)
library(plotly)
library(caTools)
library(reshape2)
```

### Examine data
```{r}
wages <- read_csv("free-zipcode-database.csv")
glimpse(wages)
```

### Looking at United States only
```{r}
wages <- wages %>% filter(Country == "US")
```

### Removing unwanted variables
```{r}
wages <- wages %>% select(-RecordNumber,-Xaxis,-Yaxis,-Zaxis,-Location,-WorldRegion,-LocationText,-LocationType,-Country,-Decommisioned,-Notes)
```

### Removing Unwanted cases like puerto rico and NA values and duplicates
```{r}
unique(wages$State)

wages <- wages %>% filter(State != "PR")

for(i in 1:9){
  wages<- wages %>% filter(!is.na(wages[i]))
}

wages <- wages %>% distinct(Zipcode, .keep_all = TRUE)
```
We see a loss of about 50,000 values in the data set. About 25,000 lost from blank values and another 25,000 from duplicates. Since this is the majority of the data set, it is not a very clean data set and may not have a very accurate representation of the initial data. However, I will still plan to analyse the data in a comprehensive way in order to find answers to the proposed questions.

### Create an average wage column by dividing the total wage by the estimated population and formatting for the dollar amounts
```{r}
wages <- wages %>% mutate(averageWage = format(round(TotalWages/EstimatedPopulation,2), nsmall = 2))
wages <- wages %>% mutate(averageWage = TotalWages/EstimatedPopulation)
```

### EXporting a csv file in order to use the cleaned data in other programs i.e. Tableau
```{r}
write.table(wages,file = "CleanWages.csv",row.names = F,sep = ",")
```


### What is the 100 highest wage zip codes?
```{r}
options(scipen = 999)
highestwage <- wages %>% arrange(-TotalWages)
highestwage <- highestwage %>% mutate(rank = 1:28844)
highestwage <- highestwage %>% filter(rank <= 100)
print(head(highestwage))
```



```{r}

highestwage %>% ggplot(aes(x = State,fill = State)) + geom_bar() + labs(title = "Number of High Wage Zip Codes Per State in the Top 100" ,x = "State",y = "Frequency")
```
Looking at the graph of each state in the top 100 and the Total wages of the zip codes in each state, it is easy to see that California, New York, Texas, and Illinois contribute the majority of the zip codes with the highest wages.

```{r}
options(scipen = 999)
highestwage %>% ggplot(aes(x = TotalWages/1000000000)) + geom_density() + labs(title = "Density of Total Wages in the Top 100" ,x = "Total Wages (Billion $)",y = "Density") 
```
By taking a look at the graph of the density of each price we can see that the majority of zip codes in the top 100 highest wages are around 1.7 billion dollars. Also as you get past 1.9 billion dollars the amount of zip codes decreases steadily.

### What is the 100 lowest wage zip codes?
```{r}
lowestwage <- wages %>% arrange(TotalWages)
lowestwage <- lowestwage %>% mutate(rank = 1:28844)
lowestwage <- lowestwage %>% filter(rank <= 100)
print(head(lowestwage))
```

```{r}
lowestwage %>% ggplot(aes(x = State,fill = State)) + geom_bar() + labs(title = "Number of Low Wage Zip Codes Per State in the Bottom 100" ,x = "State",y = "Frequency")
```
Here we see that Michigan, Arizona, and Texas are the 3 most frequent states in the Bottom 100.


```{r}
options(scipen = 999)
lowestwage %>% ggplot(aes(x = TotalWages/1000000)) + geom_density() + labs(title = "Density of Total Wages in the Bottom 100" ,x = "Total Wages (Million $)",y = "Density") 
```
Similar to the last density chart, this graph predicts the majority of wages in the bottom 100 are around $4.5 million.


### What is the 100 highest average wage zip codes?
```{r}
highestavgwage <- wages %>% arrange(-averageWage)
highestavgwage <- highestavgwage %>% mutate(rank = 1:28844)
highestavgwage <- highestavgwage %>% filter(rank <= 100)
print(head(highestavgwage))
```

```{r}
highestavgwage %>% ggplot(aes(x = State,fill = State)) + geom_bar() + labs(title = "Number of Average Wage Zip Codes Per State in the Top 100" ,x = "State",y = "Frequency")
```
The average zip code wage in each state shows a different story than the total wages in each zip code. The two states that make up the majority are New York and California, where New York is much more frequent.

```{r}
options(scipen = 999)
highestavgwage %>% ggplot(aes(x = averageWage/1000)) + geom_density() + labs(title = "Density of Average Wages in the Top 100" ,x = "Total Wages (Thousand $)",y = "Density") 
```
The densities show that the majority of the average zip code wages in the top 100 are around 190 thousand dollars. 

### What is the 100 lowest average wage zip codes?
```{r}
lowestavgwage <- wages %>% arrange(averageWage)
lowestavgwage <- lowestavgwage %>% mutate(rank = 1:28844)
lowestavgwage <- lowestavgwage %>% filter(rank <= 100)
print(head(lowestavgwage))
```

```{r}
lowestavgwage %>% ggplot(aes(x = State,fill = State)) + geom_bar() + labs(title = "Number of Average Wage Zip Codes Per State in the Bottom 100" ,x = "State",y = "Frequency")
```
Similar to the lowest total wages, Michigan and Arizona are the highest contributors to the bottom 100 average zip code wages.


```{r}
options(scipen = 999)
lowestavgwage %>% ggplot(aes(x = averageWage/1000)) + geom_density() + labs(title = "Density of Average Wages in the Bottom 100" ,x = "Total Wages (Thousand $)",y = "Density") 
```
Here the majority of the average wages in the bottom 100 are around 7.5 thousand dollars.

### What are the 3 states with the highest wages by total?
```{r}
statewage <- wages %>% group_by(State) %>% summarise(
  totalstatewages = sum(TotalWages))
highstatewage <- statewage %>% arrange(-totalstatewages)
highstatewage <- highstatewage %>% mutate(rank = 1:51)
highstatewage <- highstatewage %>% filter(rank <= 3)
print(head(highstatewage))
```
These results support the previous conclusion from Graph 1 where we saw that California Texas and New York were among the main contributors of the highest state wages. However these results give us further insight that Illinois is not actually among the top 3 highest wage states when looking at the total wages


### What are the 3 states with the lowest wages by total?
```{r}
lowstatewage <- statewage %>% arrange(totalstatewages)
lowstatewage <- lowstatewage %>% mutate(rank = 1:51)
lowstatewage <- lowstatewage %>% filter(rank <= 3)
print(head(lowstatewage))
```


### What does the average of each state look like?
```{r}
state_avg_wage <- wages %>% 
  group_by(State) %>% 
  summarise(avgstatewages = mean(TotalWages))

plot_geo(data = state_avg_wage,
                      locationmode = 'USA-states') %>% 
  add_trace(locations = ~State,
            z = ~state_avg_wage$avgstatewages,
            zmin = min(state_avg_wage$avgstatewages), 
            zmax = max(state_avg_wage$avgstatewages),
            color = state_avg_wage$avgstatewages) %>% 
  layout(geo = list(scope= 'usa'),
         title = "\nAverage Wages in the United States by State") %>% colorbar(tickprefix = "$")
```
The graph above shows that out of all the states, the highest wage location is Washington DC with an average wage of about $550 Million. The second two locations are California and New Jersey by a $150 million wage gap. In comparison to all other Locations, these three stand out as states and territories with high wages.

### Showing the Highest Average Wages in Tabular Form 
```{r}
state_avg_wage <- state_avg_wage %>% 
  arrange(-avgstatewages ) 
head(state_avg_wage)
```

### Importing Other Datasets From Online For Further Analysis
```{r}
CostOfLiving <- read.csv("Cost of Living.csv")
StateAbrev <- read.csv("StateAbrev.csv")
```

### Preparing Data For Joining
```{r}
StateAbrev <- StateAbrev %>% rename(State = USPS.Abbreviation)
CostOfLiving <- CostOfLiving %>% rename(State.Name = State)
```

### Joining the Datasets
```{r}
CostOfLiving <- CostOfLiving %>% full_join(StateAbrev, by = "State.Name")
State_avg_vs_COL <- state_avg_wage %>% full_join(CostOfLiving, by = "State")
State_avg_vs_COL <- State_avg_vs_COL %>% select(-Rank,-State.Name)
```

### Working With the New Dataset To Determine Potential Relationships
```{r}
plot_geo(data = State_avg_vs_COL,
                      locationmode = 'USA-states') %>% 
  add_trace(locations = ~State,
            z = ~State_avg_vs_COL$Index,
            zmin = min(State_avg_vs_COL$Index), 
            zmax = max(State_avg_vs_COL$Index),
            color = State_avg_vs_COL$Index) %>% 
  layout(geo = list(scope= 'usa'),
         title = "\nCost of living Index in the United States by State")
```
Upon inspection of the Cost of Living map in comparison to the Averages Wages map, there are some clear trends. California and DC Remain in the top 3 in both maps. However Hawaii has a much higher cost of living than compared to its average wage. This is most likely due to its status as a "vacation state". The rest of the states are kind of ambiguous when looking at the choropleth map. Further inspection of the correlation will give us an idea of the relationship.

### Testing Correlation in order to quantify the relationship
```{r}
cor(State_avg_vs_COL$Index, State_avg_vs_COL$avgstatewages, use = "pairwise.complete.obs")
```
Here we see that there is a moderately strong positive correlation between the two variables. Intuitively this is not a surprising discovery, however I will make a correlation matrix to see which of the factors that contribute to the Cost of living carry more weight when looking at the average wage in each state.

### Correlation Matrix
```{r}
cor_matrix <- State_avg_vs_COL %>%
  select(-State) %>%
  cor(use = "pairwise.complete.obs")
cor_matrix <- round(cor_matrix, digits = 2)

meltCorMat <- melt(cor_matrix)

meltCorMat %>% ggplot(aes(x = Var1, y = Var2, fill = value)) + geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1))+
 coord_fixed() +
  geom_text(aes(Var2, Var1, label = value), color = "black", size = 4)
```
Here we can see that the two biggest correlations other than The total cost of living index is the housing price index and a misc. index which I summarize to mean recreational activities and commodities such as eating out and entertainment systems.

### Creating scatter map based on longitude and latitude
```{r}
geo_prop <- list(scope = 'usa',
                 projection = list(type = 'albers usa'), 
                 showland = TRUE,
                 showsubunits = TRUE,
                 landcolor = toRGB('gray10'),
                 showlakes = TRUE, 
                 lakecolor = toRGB('white'))

plot_geo(wages, 
        lat = ~Lat,
        lon = ~Long,
        marker = list(size = wages$averageWage/15000),
        text = wages$City) %>% layout(geo = geo_prop, title = "\nDensity and Intensity of the Average Wage for US Zip Codes")
```
The Map above shows all of the zip codes plotted via their latitude and longitude. The size or intensity of every point is proportional to how high the average wage in the zip code is. I just figured I would plot this because its a good looking graph and can express the range of zip codes left in the data after it had been cleaned.



## Training A Linear Regression Model to Predict Average Wages In a Zip Code

### Training the Model
```{r}
LineWage <- wages %>% select(-State,-City,-ZipCodeType,-Zipcode)

set.seed(2)
split <- sample.split(LineWage,SplitRatio = 1/4)
train <- subset(LineWage, split = "TRUE")
test <- subset(LineWage, split = "FALSE")

model <- lm(averageWage~.,data = train)
summary(model)
```

### Testing the Model
```{r}
predict <- predict(model, test)
```

### Graphing for Accuracy
```{r}
plot(predict, type = "l",col = "red") + lines(test$averageWage)
```

### Calculating Root Mean Square Error for Accuracy
```{r}
rmse <- sqrt(mean(predict-LineWage$averageWage)^2)
print(rmse)
```
The error calculation is very low; this indicates a well trained model for future data.


### Sources

1. Kaggle:   https://www.kaggle.com/pavansanagapati/us-wages-via-zipcode
2. YourDictionary:   https://abbreviations.yourdictionary.com/articles/state-abbrev.html
3. MERIC:   https://meric.mo.gov/data/cost-living-data-series
